Reverse Experience Replay (RER) was proposed by E. Rotinov1 to overcome environments with delayed (sparse) rewards. Since many transitions doesn’t have immediate reward, random sampling is inefficient.
In RER, equally strided transitions are sampled from the latest transition. The next sample contains one step older transions.
\[ \begin{align} B_1 &= \lbrace& T_{t} &, T_{t-stride} &, \dots &, T_{t-batch~size \times stride} &\rbrace \\ B_2 &= \lbrace& T_{t-1}&, T_{t-stride-1}&, \dots &, T_{t-batch~size \times stride - 1} &\rbrace \\ &\vdots&&&&&& \end{align} \]
When the first sample index (\(t-i\)) becomes \(2 \times stride\) old from the latest transition, the first sample index is reset to the latest transition.
Parameters | Default | Description |
---|---|---|
stride | 300 | Sample stride |
The usage of ReverseReplayBuffer
is same as the usage of ordinary ReplayBuffer
.
from cpprb import ReverseReplayBuffer
buffer_size = 256
obs_shape = 3
act_dim = 1
stride = 20
rb = ReverseReplayBuffer(buffer_size,
env_dict = {"obs": {"shape": obs_shape},
"act": {"shape": act_dim},
"rew": {},
"next_obs": {"shape": obs_shape},
"done": {}},
stride = stride)
obs = np.ones(shape=(obs_shape))
act = np.ones(shape=(act_dim))
rew = 0
next_obs = np.ones(shape=(obs_shape))
done = 0
for i in range(500):
rb.add(obs=obs,act=act,rew=rew,next_obs=next_obs,done=done)
if done:
# Together with resetting environment, call ReplayBuffer.on_episode_end()
rb.on_episode_end()
batch_size = 32
sample = rb.sample(batch_size)
# sample is a dictionary whose keys are 'obs', 'act', 'rew', 'next_obs', and 'done'
The author indicated stride
size must not be multiple of the length of episode horizon to avoid sampling similar transitions simultaneously.
E. Rotinov, “Reverse Experience Replay” (2019), (arXiv:1910.08780) ↩︎